NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Borda Regret Minimization for Generalized Linear Dueling Bandits

Wu, Yue; Jin, Tao; Di, Qiwei; Lou, Hao; Farnoud, Farzad; Gu, Quanquan (July 2024, International Conference on Machine Learning (ICML))

Full Text Available
Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Di, Qiwei; Zhao, Heyang; He, Jiafan; Gu, Quanquan (May 2024, International Conference on Learning Representations)

Full Text Available
VARIANCE-AWARE REGRET BOUNDS FOR STOCHASTIC CONTEXTUAL DUELING BANDITS

Di, Qiwei; Jin, Tao; Wu, Yue; Zhao, Heyang; Farnoud, Farzad; Gu, Quanquan (May 2024, 12th International Conference on Learning Representations (ICLR))

Full Text Available
Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits

Di, Qiwei; Jin, Tao; Wu, Yue; Zhao, Heyang; Farnoud, Farzad; Gu, Quanquan (May 2024, ICLR)

Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $$\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$$, where $$\sigma_t$$ is the variance of the pairwise comparison in round $$t$$, $$d$$ is the dimension of the context vectors, and $$T$$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $$\tilde O(d)$$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
more » « less
Full Text Available
Borda regret minimization for generalized linear dueling bandits

Wu, Yue; Jin, Tao; Di, Qiwei; Lou, Hao; Farnoud, Farzad; Gu, Quanquan (April 2024, Proc. ICML 2024)

Full Text Available
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path

Di, Qiwei; He, Jiafan; Zhou, Dongruo; Gu, Quanquan (January 2023, International Conference on Machine Learning (ICML))

Full Text Available

Search for: All records